1. Load the Dataset

2. It is always a good practice to eye-ball raw data to get a feel of the data in terms of number of records, structure of the file, number of attributes, Mention a few comments in this regard

Applying EDA to the dataset we can make the following inferences for the columns

MDVP:Fo(Hz) The values range from a min of 88 to a maximum of 260, with mean as 154 and median as 148. As mean is greater than the median, we can infer that the distribution is skewed to the right also with a tail stretching towards the right.It's possible to find a few outliers as the 99.7% range is different than the min and max range of values

MDVP:Fhi(Hz) The mean(197.104918 ) is far greater than the median(148.790000). The distribution should be skewed to the right and as seen below the 99.7 percent of values are hugely different in range than the min and max values so we can expect a number of outliers on the positive side

MDVP:Flo(Hz) - Minimum vocal fundamental frequency For MDVP:Flo(Hz) the mean is 116.32463 and the median is 104.315000 which again shows a high positive skewness. From the below calculation we can see that the max and min values are very different than the 99.7 lower and higher ranges , which suggests a lot of outliers on the positive side.

MDVP:Jitter(%) - Measure of variation in fundamental frequency</br> Mean and median are similar implying normal distribution.</br> Again there is a lot of variation in the min max range of data and the 99.7 percent lower and higher ranges implying outliers on the right side as done in the calculations below.

MDVP:Jitter(Abs)</br> The mean and median values are close implying normal distribution.</br> The min and max value ranges are very different than the 99.7 percent values which implies a lot of outliers on the right side.

MDVP:RAP</br> Comparing the mean and the median we can see that the values are right skewed.</br> Also looking at the 99.7 percent data and the min max ranges we can see that there are a lot of outliers that can be expected.

MDVP:PPQ </br> Mean and median differ and the min and max values are quite different than the 99.7% ranges. We can infer that the data is positively skewed with a lot of outlier values.

Jitter:DDP
Comparing the mean and the median we can see that the values are right skewed. Also looking at the 99.7 percent data and the min max ranges we can see that there are a lot of outliers that can be expected.

MDVP:Shimmer</br> Not too much difference between mean and median. The difference in values of min and max from the 99.7 lower and high ranges suggest a skew on the positive side.

MDVP:Shimmer(dB) mean and median are different implying positive skewness The min max values and the 99.7 percentage ranges as given below show a lot of difference and implies a lot of outliers. The same is confirmed by the boxplot below.

Shimmer:APQ3 - </br> Not much difference between the mean and median but the 99.7 ranges and the min max values are quite different suggesting positive skewness.

sns.boxplot(p_data["Shimmer:APQ3"])

Shimmer:APQ5</br> Mean and median differ, and the min max values compared to the 99.7 percentage range shows an indication of positive skewness.

MDVP:APQ Mean and median differ and a lot of difference in the min max value vis a vis the 99.7 percent value which suggests positive skewness. The same is confirmed by the boxplot below

Shimmer:DDA</b> Difference in mean and median at this scale. The min max values and the lower and higher 99.7 percent ranges show quite a bit of difference which suggests positive skewness in data. Same is confirmed by the boxplot below

NHRDifference in mean and median at this scale. The min max values and the lower and higher 99.7 percent ranges show quite a bit of difference which suggests positive skewness in data. Same is confirmed by the boxplot below

HNR

RPDE</br> The mean and median are almost same.Very few outliers on the left side.

DFA

spread1

spread2

D2

PPE

Most of the data is of float type(conituous) with name as an object type and status as the categorical variable and is label encoded. </br> The status is the value that we need to predict and is the dependent variable. It takes two values 0 and 1 which signify healthy and parkinson afflicted people respectively.</br> The name attribute adds no value to the machine learning analysis and can be deleted from the dataset.

No Null data found

Checking for skewness in data

The data as shown in the individual column analysis and the skew function for the dataset shows skewness in almost all the columns

Relationship of Status with other column averages

The status 0 implies a person who is healthy and status 1 is the person who is healthy.
A comparitive study of all the columns are given above for healthy and parkinson inflicted person.

Checking outliers for each column

Most columns have outliers , the number of outliers along with the column are printed above. This is one of the challenge that we will have to address as we work through the dataset

The ratio of people who are parkinson challenged are very high than the people who don't. Ideally a homogenous dataset is well suited for machine learning analysis. We might have to balance the classes.

Overall the key challenges are</br>

  1. skewness in data </br>
  2. bias in data </br>
  3. outlier values</br>

3. Using univariate & bivariate analysis to check the individual attributes for their basic statistics such as central values, spread, tails, relationships

Box plot of all the features grouped by status to understand the distributions, spreads and tails

We had plotted the individual boxplots in Section 2 and did analysis on the skewness and outliers present in data.
Grouping the box plots by status gives a visual comparison of this spread across the status.
As we can see for the features given above, the medians are quite different for each of the statuses of the feature
and there is a considerable difference in the range of the values and the spread
of the outliers for the same feature with different status.This indicates the difference in each of the parameter values
for a parkinson inflicted person and a healthy person. We will carry out more elaborate breakup in the following sections.

Skewness

outliers

MDVP:Fo(Hz)

As can be seen from the box plots of section 2 and the graphs there is a bit of positive skewness in data. The skewness values of all the columns are populated in the skewness section.</br> There are no outliers as can be seen from the box plot and the outlier calculation in this section.</br> The distribution values for PD and healthy people comparison is shown in the kde graphs.

MDVP:Fhi(Hz), MDVP:Flo(Hz), MDVP:Jitter(%)

MDVP:Shimmer,MDVP:Shimmer(dB),Shimmer:APQ3,Shimmer:APQ5,MDVP:APQ,Shimmer:DDA

The box plot distribution for the above features are explained in section 2.
The above graph depict a comparitive study of features with respect to status, that is the distribution of people with PD vs non Parkinson.
The outlier values are given below and has been presented for all the features in the
beginning of the section
The mean and median values along with 99.7 percent range comparison with min and max values are printed above the graph to get an idea of the distribution and skewness.
The exact values are as following
The outlier numbers are given below
no. of outliers are 8 for the column MDVP:Shimmer
no. of outliers are 10 for the column MDVP:Shimmer(dB)
no. of outliers are 6 for the column Shimmer:APQ3
no. of outliers are 13 for the column Shimmer:APQ5
no. of outliers are 12 for the column MDVP:APQ
no. of outliers are 6 for the column Shimmer:DDA
The skewness value of the fields are
MDVP:Shimmer skewness is 1.6664804101559663
MDVP:Shimmer(dB) skewness is 1.999388639086127
Shimmer:APQ3 skewness is 1.5805763798815677
Shimmer:APQ5 skewness is 1.798697066537622
MDVP:APQ skewness is 2.618046502215422
Shimmer:DDA skewness is 1.5806179936782263

NHR,HNR

Below are the values for outliers and skewness
no. of outliers are 19 for the column NHR
no. of outliers are 3 for the column HNR
NHR skewness is 4.22070912913906
HNR skewness is -0.5143174975652068
The boxplot has been plotted in section 1
HNR shows a bit of negative skewness and is also confirmed by the skewness value above. NHR is positively skewed. The maximum number of observations is between 0 and 0.04. The distribution of these features in terms of status is also plotted above and indicates a sharp rise in distribtion at 25 for non PD vs 20 for PD.

RPDE,D2

Skewness and outlier values
RPDE skewness is -0.14340241379821705
D2 skewness is 0.4303838913329283
no. of outliers are 0 for the column RPDE
no. of outliers are 1 for the column D2
For RPDE there are no ouliers, KDE rug shows no clear separation between PD and non PD.
For D2 There is one outlier in a complete box plot without segmentation and 2 if we segment it into PD and non PD. The values are very intermixed for the two status classes. The measure of skewness is indicated above with the relevant sign

spread1,spread2,PPE

Skewness values :-
spread1 skewness is 0.4321389320131796
spread2 skewness is 0.14443048549278412
PPE skewness is 0.7974910716463578
Outlier values :-
no. of outliers are 4 for the column spread1
no. of outliers are 2 for the column spread2
no. of outliers are 5 for the column PPE

We can see a good separation of values in the above features between PD and non PD. It might turn out to be a good indicator.
Skewness and outlier values indicating the spread is given above along with the sign.
The distribution segmented by status is populated above along with the kde graph which gives the density of values for each of the feature

Bivariate/Multivariate analysis

We can see above in the bar and cat plot that the below features show quite a bit of
difference in values when comparing the PD and non PD values
MDVP:Fo(Hz)
Jitter:DDP
MDVP:Shimmer
MDVP:Shimmer(dB)
MDVP:Jitter(Abs)
Shimmer:APQ3
Shimmer:APQ5
MDVP:PPQ
MDVP:APQ
NHR
PPE
MDV:RAP
spread2
spread1

Correlation Analysis

We can see a lot of correlation between the features.
Some of them are highly correlated(more than .95). Let us find out the features which are
well correlated with the target variable.

Correlation between target variable and columns

We can see that spread1 , spread2, PPE, D2, MDVP:Shimmer, MDVP:Shimmer(dB), Shimmer:APQ3, Shimmer:APQ5, MDVP:APQ, Shimmer:DDA, MDVP:Jitter(%) and MDVP:Jitter(Abs) are decently correlated with target variable.

Find variables which are more than or equal to .95 correlated with each other

The high_corr_columns shows the list of unique columns having greater than equal to .95 correlation and can be dropped.

4. Split the dataset into training and test set in the ratio of 70:30

5. Prepare the data for training - Scale the data if necessary, get rid of missing values (if any) etc

Missing value and nan value checks have been applied in earlier section.
The data looks homogenous across the columns, but for more sophisticated erroneous value
checks domain expertise is required
Below we are scaling the data and feeding it to the algorithms

6 . Train at least 3 standard classification algorithms - Logistic Regression, Naive Bayes’, SVM, k-NN etc, and note down their accuracies on the test data

Logistic Regression

KNN With Scaling

Naive Bayes Classification

Applying the algorithms after dropping the correlated columns from section 3
['MDVP:RAP', 'MDVP:PPQ', 'Jitter:DDP', 'MDVP:Shimmer(dB)', 'Shimmer:APQ3', 'Shimmer:APQ5', 'MDVP:APQ', 'Shimmer:DDA', 'PPE']

Logistic Regression

KNN

After deleting the correlated columns there is an increase in accuracy that can be seen for knn.

Naive Bayes

There is an increase in efficiency in Naive Bayes after the correlated columns were dropped.

The KNN model with dropped correlated columns give the best accuracy.
Below are the scores printed for KNN
Accuracy on train set: 0.96
Accuracy on test set: 0.93
Recall score: 0.93
ROC AUC score: 0.94
Precision score: 0.97

7. Train a meta-classifier and note the accuracy on test data

The metaclassifier builds up an accuracy of. 96.32 which is almost equivalent to the knn model with dropped correlated columns that we found above.

Train at least one standard Ensemble model - Random forest, Bagging, Boosting etc, and note the accuracy

Random Forest Classifier

Trying to find the best split values for random forest but reaching system limits. A comparison of the different attributes will give us the best hyperparameters.

9. Compare all the models (minimum 5) and pick the best one among them

Following are the best scores for each of the models
1. Logistic regression (with scaling)
Accuracy on train set: 0.90
Accuracy on test set: 0.85
Recall score: 0.97
ROC AUC score: 0.78
Precision score: 0.83

2. KNN
Accuracy on train set: 0.96
Accuracy on test set: 0.93
Recall score: 0.93
ROC AUC score: 0.94
Precision score: 0.97

2. Naive Bayes
Accuracy on test set 0.7288135593220338
Recall score: 0.78
ROC AUC score: 0.70
Precision score: 0.82

Meta Classifier - Stacking classifier
'training_accuracy': 89.70588235294117,
'testing_accuracy': 84.7457627118644,
'roc': 77.69736842105263,
'correct_predictions': 50,
'incorrect_predictions': 9

Random Forest Clasifier
Accuracy on train set: 1.00
Accuracy on test set: 0.83
Recall score: 0.97
ROC AUC score: 0.78
Precision score: 0.83

The best performance is given by the KNN classifier.The least accurate is the Naive Bayes Classifier.

In terms of recall the Random Forest Classifier has performed extremely well. The accuracy for random forest is pretty good as well along with a good ROC/AUC score.

Conclusion

Based on different studies on Parkinson's data and the way the features
are extracted different models perform well for example in one study
https://iopscience.iop.org/article/10.1088/1742-6596/1372/1/012041/pdf
the Random Forest performed well
whereas in another the KNN with performed well
https://scialert.net/abstract/?doi=jas.2014.171.176
In the study conducted by me I have chosen delete the correlated columns with
correlation index more than .94 which has produced the above results